- Home
- Search Results
- Page 1 of 1
Search for: All records
-
Total Resources5
- Resource Type
-
0004010000000000
- More
- Availability
-
41
- Author / Contributor
- Filter by Author / Creator
-
-
Lo, Kyle (5)
-
Joseph, Sebastian (2)
-
Klein, Dan (2)
-
Lin, Kevin (2)
-
Seifert, Christin (2)
-
Trienes, Jan (2)
-
Xu, Wei (2)
-
Abbas, Amro (1)
-
Albalak, Alon (1)
-
Arora, Kushal (1)
-
Bansal, Hritik (1)
-
Bitton, Yonatan (1)
-
Carmon, Yair (1)
-
Chandu, Khyathi (1)
-
Chen, Mayee (1)
-
Daras, Giannis (1)
-
Dave, Achal (1)
-
Dimakis, Alexandros_G (1)
-
El-Nouby, Alaaeldin (1)
-
Faghri, Fartash (1)
-
- Filter by Editor
-
-
& Spizer, S. M. (0)
-
& . Spizer, S. (0)
-
& Ahn, J. (0)
-
& Bateiha, S. (0)
-
& Bosch, N. (0)
-
& Brennan K. (0)
-
& Brennan, K. (0)
-
& Chen, B. (0)
-
& Chen, Bodong (0)
-
& Drown, S. (0)
-
& Ferretti, F. (0)
-
& Higgins, A. (0)
-
& J. Peters (0)
-
& Kali, Y. (0)
-
& Ruiz-Arias, P.M. (0)
-
& S. Spitzer (0)
-
& Sahin. I. (0)
-
& Spitzer, S. (0)
-
& Spitzer, S.M. (0)
-
(submitted - in Review for IEEE ICASSP-2024) (0)
-
-
Have feedback or suggestions for a way to improve these results?
!
Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
Lin, Kevin; Lo, Kyle; Gonzalez, Joseph; Klein, Dan (, Association for Computational Linguistics)
-
Trienes, Jan; Joseph, Sebastian; Schlötterer, Jörg; Seifert, Christin; Lo, Kyle; Xu, Wei; Wallace, Byron; Li, Junyi Jessy (, Association for Computational Linguistics)
-
Lin, Kevin; Lo, Kyle; Gonzalez, Joseph_E; Klein, Dan (, arXiv)
-
Li, Jeffrey; Fang, Alex; Smyrnis, Georgios; Ivgi, Maor; Jordan, Matt; Gadre, Samir; Bansal, Hritik; Guha, Etash; Keh, Sedrick; Arora, Kushal; et al (, https://doi.org/10.48550/arXiv.2406.11794)The authors introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments aimed at improving language models. DCLM provides a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants can experiment with dataset curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters. As a baseline, the authors find that model-based filtering is critical for assembling a high-quality training set. Their resulting dataset, DCLM-Baseline, enables training a 7B parameter model from scratch to achieve 64% 5-shot accuracy on MMLU with 2.6T training tokens. This represents a 6.6 percentage point improvement over MAP-Neo (the previous state-of-the-art in open-data LMs), while using 40% less compute. The baseline model is also comparable to Mistral-7B-v0.3 and Llama 3 8B on MMLU (63% and 66%), and performs similarly on an average of 53 NLU tasks, while using 6.6x less compute than Llama 3 8B. These findings emphasize the importance of dataset design for training LMs and establish a foundation for further research on data curation.more » « lessFree, publicly-accessible full text available April 21, 2026
An official website of the United States government

Full Text Available